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The minimax theory for estimating linear functionals is extended 
to the case of a finite union of convex parameter spaces. Upper and 
lower bounds for the minimax risk can still be described in terms of a 
modulus of continuity. However in contrast to the theory for convex 
^H , parameter spaces rate optimal procedures are often required to be 

^^ ' nonlinear. A construction of such nonlinear procedures is given. The 

results developed in this paper have important applications to the 
theory of adaptation. 



1. Introduction. Let Y be an observation from either the white noise 
model, 

">; (1) dY{t) = f{t)dt + n-^/^dW{t) 

CN . where W{t) is a standard Brownian motion, or the Gaussian sequence model 

)Q: (2) Y{i) = f{i)+n-'/hi 

'sj" ■ where Sj are i.i.d. standard normal random variables. 

The minimax theory for estimating a linear functional T has been studied 

r^ ' in great generality when it is assumed that the function / belongs to a pa- 

C^ . rameter space which is convex. See, for example, Ibragimov and Has'minskii 

S I (1984), Donoho and Liu (1991a, b) and Donoho (1994). In particular, the 

properties of the minimax linear estimators can often be described precisely. 

In this case for any linear functional T write R\{n]!F) for the minimum 

(over all linear procedures) maximum mean squared error. Donoho and Liu 

^ . (1991a) introduced a modulus of continuity 

a;(e,.F) = sup{|T(5) - T{f)\ -.Wg - f\\^<ej- ^ T.g e T] 
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2 T. T. CAI AND M. G. LOW 

where the norm in this equation is the L2 norm in function space for the 
white noise with drift model and the I2 norm in sequence space for the 
sequence model. Donoho and Liu (1991a, b) and Donoho (1994) have shown 
that in either of these two cases, 

(3) R\{n-r)=sviY>uj^{e,r)- ^^^^"^ 



,>^ ^ 'l/n + e'^/A 



r'(;l-^)^«i(-^'^r^(ii-^ 



and that 



An earlier version of this result can also be found in Ibragimov and 
Has'minskii (1984). Without the restriction to affine procedures write R'^{n; JF) 
for the minimax mean squared error for estimating the linear functional T. 
Donoho and Liu (1991b) have shown that 

A^ ' ^ < 1 25 

Therefore the maximum risk of the optimal linear procedure is within a small 
constant factor of the minimax risk when the parameter space is convex. Of 
equal importance, Donoho and Liu (1991b) showed that the modulus can 
be used to give a recipe for constructing an affine procedure which has the 
maximum mean squared error attaining the bound given in (3). 

Recent work on estimating linear functionals has focused on adaptive 
estimation. The goal is to find a single procedure which is near minimax 
simultaneously over a number of different parameter spaces. Pioneering work 
in this area began with Lepski (1990). This work focused on particularly 
important examples such as Lipschitz classes. In Efromovich and Low (1994) 
a general theory was developed for the case of nested convex parameter 
spaces. 

A general extension of this adaptive estimation theory to spaces which 
are not nested must also include a minimax analysis for sets which are not 
convex. The reason for this is that we need to first know the minimax risk 
over the union of the original convex spaces and this space need not be con- 
vex unless the sets are nested. This paper focuses on such an extension of 
the minimax theory for estimating linear functionals over nonconvex param- 
eter spaces. For applications to adaptive estimation see Cai and Low (2002). 
Although as just mentioned our primary motivation for this problem is the 
theory of adaptation the minimax theory itself is in fact quite interesting. 
In particular in this setting optimal linear procedures can sometimes have 
risks far from the optimal rate. In fact even if the parameter space is only 
a union of two convex sets it is possible that the maximum risk of the best 
linear estimator does not even converge even though the maximum risk of 
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the optimal nonlinear procedure converges quickly. Such examples are given 
in Section 5. 

Although optimal linear procedures need no longer be close to optimal 
we show that the minimax rate of convergence is still determined by the 
modulus of continuity over the parameter space when the parameter space 
is a finite union of convex sets. On the other hand, in Section 4, it is shown 
that the minimax linear risk is determined by the modulus of continuity 
over the convex hull of the parameter space. Therefore affine procedures 
fail when, in terms of the modulus, the convex hull is much larger than 
the parameter space itself. Such are the cases in the examples in Section 
5. In these cases rate optimal estimators need to be nonlinear. A general 
construction of such nonlinear procedures is given in Section 3. 

One of the main tools for the construction of the general procedure is a 
construction of linear procedures which have a given variance and precisely 
control the bias over two different convex parameter spaces. Upper bounds 
are given on the bias over one parameter space and lower bounds over the 
other. These linear procedures can then be used to test which of the convex 
sets the function lies in and then usual linear procedures can be used. The 
details of these arguments can be found in Sections 2 and 3. 

The theoretical results are complemented by several illustrative examples 
given in Section 5 covering a range of cases. In the examples of estimating a 
linear functional of a nearly black object the parameter space is the union 
of a growing number of convex parameter spaces. In these cases the usual 
minimax lower bound is no longer sharp and the minimax rate of convergence 
is derived explicitly using a mixture prior and a constrained risk inequality. 

2. Ordered modulus and bias variance tradeoffs. One of the main tools 
for the construction of the general minimax procedure is the construction 
of linear procedures which have a given variance and precisely control the 
bias over two different convex parameter spaces. Upper bounds are given on 
the bias over one parameter space and lower bounds over the other. The key 
technical tool which allows for this construction is an ordered modulus of 
continuity between two function spaces. It is a generalization of the modulus 
of continuity introduced by Donoho and Liu (1991a) which has already been 
shown in Low (1995) to allow for the construction of a procedure which 
minimizes the maximum squared bias given a constraint on the maximum 
variance. 

For a linear functional T define an ordered modulus of continuity between 
two classes u;(e, J^.,Q) by 

u:{e,T,g)=snv{Tg-Tf:\\g-f\\2<e;f£T,g^g]. 

Note that uj{£,J^,Q) does not necessarily equal ij{£,Q,J-'). It is clear that 
the modulus Lj(e, J^, Q) is an increasing function of e and < u;(e, J-,Q)<(X) 
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\i J-r\Q ^ 0. The between class modulus is also instrumental in the analysis 
of adaptation over different parameter spaces [see Cai and Low (2002)]. 

When Q = !F, uj{e,T,T) is the usual modulus of continuity over T and 
will be denoted by uj{e,J-'). The following result on the concavity of the 
modulus is important in the bias variance tradeoffs and in the construction 
of the general minimax procedure. 

Theorem 1. Assume that T , Q are convex and that T V\Q ^ . Let T 
he a linear functional. Then the function uj{e, J-', Q) is a concave function of 
£. In particular it follows that, for D > 1, 

uj{De,T,g)<DLo{£,J^,g). 

Proof. Suppose that gi € Q, g2 ^ G and fi G J^, f2 € J^ with 

hi - fih <^i- 
Then, for < A < 1, 

IIA52 + (1 - X)gi - [A/2 + (1 - A)/i]||2 < Ae2 + (1 - A)ei 
and 

T(A<72 + (l-A)<7i-[A/2 + (l-A)/i]) 

= A(T(52) - r(/2)) + (1 - A)(r(5i) - r(/i)). 

It then follows that 

a;(Ae2 + (1 - \)ei,r,G) > \uj{e2,r,G) + (1 - \)oj{ei,r,g) 
and so u: is concave. D 

As mentioned earlier in Low (1995) it was shown that in the white noise 
model for any linear functional the modulus of continuity can be used to 
precisely trade off various levels of bias and variance over a given convex 
parameter space. The modulus of continuity between parameter spaces can 
be used to perform an analogous trade. It can be used to give a linear 
procedure which has upper bounds for the bias over one parameter space 
and lower bounds for the bias over the other parameter space. The detailed 
results are given in Theorems 2 and 3 below. 

We shall write (u, v) for the usual I2 inner product for either sequence or 
function space. Specifically if we observe the white noise with drift model 
let 

{f.g)= [fg 
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and if we observe the sequence model let 

{f,9) = '^figi- 
For all y > let 

(4) B{V,T,g) = 2-^sup{uj{e,J^,g)-VnVe). 

£>0 

It will also be convenient to introduce an inverse of B{y,J-,Q) defined for 
all 5 > by 

(5) V{B,J^,G)=snv^{[u:{£,r,Q)-2B]+f. 

We shall show in Theorems 2 and 3 that there is a linear estimator with 
variance bounded by F, which has maximum bias over T less than or equal 
to B{V, T ^ Q) and minimum bias over Q greater than or equal to —B(V, J-, Q). 
Theorem 2 covers the most usual situations where linear estimators can be 
easily described in terms of the modulus. Theorem 3 extends the theory to 
cover the general case. 

Our analysis is split into a number of cases. The most usual ones are 
covered by cases 1(a) and 2(a). It is these cases which are in fact needed in 
the construction of the general procedure in Section 3. We include the others 
for completeness. First note that we shall always assume that uj[1,T,Q) > 0; 
otherwise the linear functional is constant over J- UQ and the estimation 
problem is thus trivial. 

Case 1. Suppose that < B{V, T, Q) < c5o. Then define £{V, T, Q) by 

(6) e{y,T,Q) = w<gms.yi{u}{e,T,Q)- \fnVe) 

£>0 

where eiy^T^Q") is the smallest value of e for which the maximum in (4) 
is attained. It will be convenient to break case 1 into two further cases, 
namely: 

(a) 0<e(F,J^,g)<(X). 

(b) e{y,T,Q) = oo. 

Case 2. Biy.T.Q) = and B{V',T,g) > for all 0<V' <V. Note 
that if B{V' ,J-',Q) = for some V <V then we could reduce the variance 
of our estimator without increasing the magnitude of the bias. Under this 
assumption there are only two possibilities. 

(a) u!{e,J^,G) = VnVe on some interval < e < eo where eo > 0. We can 
then define e{V, T ^ G) to be the largest e < -^ for which L<;(e, J^, Q) = \/nVe. 
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(b) u;(e, JT, Q) < \^nVe whenever e > 0. It then fohows from the concavity 
of the modulus that < B{y' ,T,Q) < cxo for some V < V. In this case set 

The following technical lemma shows that B(V,J-,g) is continuous in V 
whenever it is finite. 

Lemma 1. Suppose T and Q are closed and convex with T V\Q ^ Qi . 
Then e(y,J-',Q) is nonincreasing in V . Assume B{V,J-',Q) < oo. Then eiV', 
r,Q) <oo if V'>V and 

(7) YimB{Vm,J',Q) = B{y,r,Q). 

If, in addition, B{V^ ^J-jQ) < oo for some V < V, then 

(8) lim BiVm,T,g)=B{V,T,g). 

Proof. Note first that the monotonicity of £{V,J-,Q) and the fact that 
eiy' ,!F, ^) < oo if y > y follows from the concavity of the modulus a;(e, !F, Q) 
as shown in Theorem 1. Now assume that B{V,J^,G) < oo and let Vm i V. 
Note that 

B{Vrn,j^,g)<B{v,r,g) fory^>y, 

and that for any e 

B{Vm,J',g) > 2~\oj{e,T,G) - V^^e). 
Taking limits yields 

\imm{B{Vra,r,g)>2^^{oj{e,r,g)-V^e) 

for all £ and so taking the supremum over all e on the right-hand side shows 
that the limit exists and is equal to B{V,J-,g). This proves (7). 

Note that B{V,J^,Q) is a convex function of VV since it is a supremum 
of a collection of convex functions of ^/V. Hence B{V,!F,Q) is continuous in 
V on any open interval over which it is finite. Hence if B{V' ,J-',Q) < oo for 
some V' <V then B(-,J^,g) is continuous at V and so (8) follows. D 

We now state the bias — variance tradeoff theorem in the most easily un- 
derstood and most typical case where < e{V,T,g) < oo and the modulus 
is attained by two functions f £ T and g gG- 

Theorem 2. Suppose T and Q are convex and closed with T V\Q ^ . 
Assume that < £{V,J^,Q) < oo. Suppose further that there are f G J^,g £ Q 
such that 

(9) \\9-f\\2 = e{V,J',g) = ev and Tg -T f = oj{ev,J',Q). 
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Write u = ^^ for the direction of the affine family joining g and f . Let 

(10) a = T{l±l)-V^{u.l±<l 

Then the estimator 

(11) fv = a + VnV fu{t)dY{t) 

for the white noise with drift model and the estimator 

(12) ry = a + \/^^M(f)y(i) 
for the sequence model have constant variance 

(13) E{fv - ETvf = V 
and have biases hounded by 

(14) suvEffv-Tf = B{V,F,Q) 

and 

(15) mlEgfv-Tg = -B{V,T,g). 

Remark. If T and Q are closed, convex and norm bounded with nonempty 
intersection then the condition that the modulus is attained is guaranteed. 
The extension to cases where either the modulus is not attained as well as 
for when eiy,!F,Q) = and e{y,!F^Q) = c« will be covered in Theorem 3. 

Proof of Theorem 2. The proof of this theorem essentially follows 
that of Theorem 2 in Low (1995). Note that the proofs of (14) and (15) are 
entirely similar so we shall only give the details for the proof of (15). 

Let f ^T and 5 G ^ be extremal functions satisfying (9) which exist since 
T and Q are closed. Let h be any other element of Q. The affine family joining 
g and h is given by (1 — Q^g + Qh^ < < 1. Let 

J{e) = T((l - e)g + Oh) -Tf- V^\\{1 - e)g + 9h- /1I2. 

It follows from the definition of £{V,J-,Q) given in (6) that J{0) < J(0) for 
all < < 1 and since J (9) is clearly differentiable it follows that J'(0) < 0. 
A simple computation shows that 

(16) Th-Tg-VnV{u,{h-g))<0. 
Now 

(17) Efv-Tg = T(^^^ + V^(u,l^g-^^'^-Tg 
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and 

(18) Efv-Th = T(^^)+V^/u,(h-^)\-Th. 

It then follows from (16)-(18) that 

(19) {ETv - Tg) - {ETv - Th) < 0. 
Finally note that a simple calculation yields 

(20) Efv-Tg = -B{V,T,g). 

Equations (19) and (20) combine to show (15) and the proof is complete. 

D 

Theorem 2 treats the cases 1(a) and 2(a) under the additional assumption 
that the modulus is attained hy f G J^ and g £ Q- The functions / and g 
are used explicitly in the construction of the estimate Ty. In general, the 
modulus may not be attained and in these cases the description of a linear 
estimator which trades variance and bias is more involved. We describe the 
general case in detail in the following theorem. Some of the details are similar 
to those given in Section 12 of Donoho (1994). 

Define B{m) to be the closed L2 ball with radius m and let J^^, = JFn 
B{m) and Qm = G C) B{m). It follows from Lemma 2 of Donoho (1994) that 
for J^m, and G^ the modulus oj{e,!Fm-,Qm) can always be attained by some 
f £Tm and gGQm- 

Define Vm, £rm fm and gm. in the following way. 

Case 1. 

(a) < B{V,T,Q) < 00 and < e{V,T,Q) < 00. In this case let Vm = V, 
l{m) = m and define Em = ^(ym,^i{m)jGi(m))- Note that for large m, Em > 0. 
Moreover, since both J^m and Gm are contained in B[m) it follows that 
Em < 2?7T.. Since J-'um) and Gitm) are closed and norm bounded it follows 
from Lemma 2 of Donoho (1994) that the modulus Lv{Em,J~'i(m)iGi(m)) is 
attained by a pair fm G J^i(m) and gm G Gi{m)- 

(b) Eiy,T,Q) = 00. In this case let Vm > V^ be chosen where Vm [ V. 
Then it follows from Lemma 1 that B{Vm,^,G) — > B(y,J-,Q). So for large m, 
< B{Vm,J^,G) < 00. Now choose an increasing sequence l{m) — > 00 so 
that B{Vm,J^i{m),Gi(m)) > 0- Now define Em = e(Kn,-^/{-m),^z{m)) and once 
again note that for large m, < e™ < 2m. Again J^i(^m) and Gi(m) are closed 
and norm bounded so the modulus i^{Em-, ^i{m)-,Gi{ni)) is attained by a pair 
fm G ^i[m) and gm & Gl(m)- 

Case 2. 
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(a) B{V, T, Q) = 0, B[V', T,Q)>^ for all Y <V and L<j(e, T, Q) = ^/^e 
on some interval < e < eg for some Sq > 0. Let l{rri) = m and note that 
at least for m sufficiently large < B{Q, Fi(^m)-,Gi{m)) < oo and that since 
^i{m) ^ ^ and Qi{^rn) ^ ^ it also follows that -B(V^, ^z(m))^/(m)) = 0- Lemma 
1 shows that for all sufficiently large m there exists a Vm < V such that 

< B{Vm,J^l(m),Sl(m)) < — • 

Now let Em = £(ym, J^i{m)^Oi{m)) ■ Then as before it follows that for large m, 
< £m < 2m,. Now since T^m) a^^d Gi(^rn) are closed and norm bounded, the 
modulus u;{£m,J^i{m),Gi{m)) is attained by a pair fm S Ji(m) and Qm G Gi(m)- 

(b) B{V,J^,g)=0, B{V',J^,g) > for alio < V < V, oj{e,J^,Q) < VnVe 
whenever e > 0. 

Now let Vm < V^ be chosen where Vm T ^- Note that there exists some 
Vo > such that 

0<B{V',J^,g)<oo 

for Vq <V' <V. Then for large m, 

0<B{Vm,T,g)<OO. 

So there is an increasing sequence l^m) — > oo such that 

< B{Vm,J^l{m),Gl{m)) <B{Vm,^,G) < OO. 

We now define Em = ^{Vm, ^i{m)jSi(m)) ■ It follows once again that < e^ < 
2m for large m. Now since J^i(^rn) and Q^^-^ are closed and norm bounded the 
modulus u;{em,^i(m),Gi{m)) is attained by a pair fm S /";(„) and gm G Qi{m)- 

For Vm, Em, fm and Qm as just defined let Um = ^'"^^■'"'- and let 

m I J ITT- ~r 9m \ i 77~ / Jm \ 9n 

am = -L [ - VnVm.i u„ 



Tn\ ""fnj 



For the white noise with drift model let 

(21) fm = am + VnVm. / Um{t) dY{t) 

and for the sequence model let 



(22) fm = am + \/nVm^Um{'i)Y{i). 

i 

The estimator Tm corresponds to the estimator Ty defined in Theorem 2 
for V = Vm, ^ = ^i{m) and Q = Qum)- I^i the general case we need to take a 
limit of the estimators T™ . 
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Note that ||um||2 = 1 and so there exists a subsequence which converges 
weakly to some function u where ||u||2 < 1- 

Now let hG J^nQ. Then since \\h\\2 < niQ for some mo < oo it follows that 
h S J-'i(m) 1^ Gi(m) foi^ all m > 77^l where l{mi) > mo. 

For 771 > 771 1 note it follows from Theorem 2 that 

\EhTm - Th\ <B{Vm,J^l(m)^Gl(m)) < — 

in case 2(a), and in all other cases 

\Ehfm-Th\<B{Vm,:Fi^m),Glim))<B{Vn„J^,g). 

Note that B{Vm,J^,G) is bounded since it converges to B(y,J-',Q). Note also 
that 



ETm = am + VnVrn{u.rn,h) 

and since the norm of Um is equal to one it follows that a^ is bounded. Hence 
there is a subsequence of the subsequence used to define u which converges 
to some finite a. Denote this subsubsequence by ml. 
For the white noise with drift model let 

(23) fv = a + VnV I u{t)dY{t) 

and for the sequence model let 

(24) Ty = a + Vni7^'u(i)y(i). 

The following theorem shows that this estimator Ty which has been formed 
as a limit of Tm trades bias and variance in the general case. 

Theorem 3. Suppose J- and Q are convex and closed with nonempty 
intersection. Then the estimator defined by {23) for the white noise with 
drift model and (24) for the sequence model satisfies 

(25) E{fv - ETvf < V 
and has biases bounded by 

(26) sup Effv -Tf< B{V, T, g) 

fey" 

and 

(27) mfEgfv-Tg>-B{V,T,g). 
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Proof. Note that (25) follows immediately from the fact that the norm 
of u is bounded by 1. We shall only give the proof for (26) since the proofs 
for the other cases are analogous. 

First note that the estimator Tm as defined in (21) and (22) satisfies the 
bounds given in Theorem 2. If / G J^itm) then 

ETrn -Tf < B{Vm,^l{m),Gl(m))- 

Let m|, be the subsubsequence along which am and Um converge to a and 
u, respectively. Now for any f £ J^, f £ ^i(m*) ^^r large k. So 

ETv -Tf < limsup£;f„. - Tf 

fc^oo 

<limsupS(y„,.,J^,^) 

fc— >oo 

<B{v,j',g). 

The last step follows from Lemma 1. D 

Remark. Using the Cramer-Rao inequality arguments found in Low 
(1995) it can be shown that the linear estimator which attains the bounds 
in the theorem is in fact unique and must actually attain the inequalities. 
It then follows that the sequence Um which was used to define the estima- 
tor Ty actually converges strongly to u and that the sequence am actually 
converges. 

3. Minimax estimator over a finite union of convex sets. Let JF = IJ^^j^ .T^j 
where for i = 1, . . . ,k, Ti are closed convex spaces with nonempty intersec- 
tions, that is, J-i r\ J-j ^ for all i, j. Our objective is to construct an es- 
timator which is rate optimal for estimating a linear functional T over the 
parameter space T. Standard two-point testing arguments as, for example, 
contained in Donoho and Liu (1991a) or Brown and Low (1996) show that 
the minimax risk for estimating a linear functional T f over T is bounded 
from below by 

(28) infsupi?(f-r/)2>ia;2/' 1 ^_^ 

T far 8 Vvn 

Let Ti be linear estimators which satisfy 

(29) sup E{fi - Tff < M^u;^ f ^,^^ 



n 



for some M > 0. As mentioned in the introduction if M > 1 such linear 
estimators are guaranteed to exist and can be constructed by the recipe 
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given in Donoho (1994). In the following discussion C will denote generic 
constants whereas M will always refer to the bounds given in (29). 

For i 7^ j, let Vi^j =u}'^{A=,Ti,Tj). Then it follows from the concavity of 

the modulus that B(Vij,J-i,J-j) as defined by (4) satisfies 

e>0 



2 ^ sup [uj{e,J='i,Tj)-^euj(^,J^i,J='j]] 






n 



Hence either B{Vij,Ti,J'j) = or < B{Vij,Ti,J^j) < 2^^uj{^,J'i,Tj). 

In the first case when B{Vij,J-'i,J-j) = it follows from the definition of 
s{Vij,Ti,J^j) given for case 2(a) that e{Vij,J^i,J^j) = A=. On the other hand 

if < B{Vi^^,Ti,Tj) < 2~^uj{^, J^i,J^j) then < e(Kj,.Fi,.Fj) < ^. Hence 

we know in both cases that < e(yij,J-'i,J-'j) < —j=. It follows that when 

using Vi,j = ijj'^{—i=,Ti,J-j) that we are in either case 1(a) or case 2(a) of 
Section 2. 

For i ^ j let Tj j- be the estimator defined as in Theorem 2 when (9) is 
attained where T = Ti, Q = Tj and V = Vij = uP'{-k^,!Fi,Tj). When (9) 

is not attained the estimator Tjj is defined as in Theorem 3. This linear 
estimator has variance bounded by uP'{^—j=^Ti^Tj) and bias which satisfies 

(30) -2~^u:[^^,T,,T^<-j^i^{E{\i)-TI) 

and 

(31) sup(E(f„)-r/)<2~ia;f^,.^„.FA 

Now based on the linear estimators Tjj and the linear estimators Tj, which 
satisfy (29), define zf,, z\a and Zij by 

T- • — T- 



u:{^,Ti,Tj) + Mu:{^,T,y 






and 
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Note that zfj and z' • are normally distributed and satisfy 

(32) max{YaT{zlj),Yai{zij)) < 1. 

Finally define the estimator of the linear functional T as 



(33) T* =T~- with i = argmini supzjj- 

The analysis of the mean squared error of T* is facilitated by the following 
lemma which bounds the probability that T* = Tj when the magnitude of 
the bias of Tj is large. 

Lemma 2. Suppose f £ J^i and for some j ^i, \Efj — Tf\ > 'yMuj{A^,J^) 
where 7 > 3. Then 

P{i = j)<2kexp( ^^~^^ 



32 
Proof. Note that if / G J^j and j 7^ i, then from (29) and (31), 



(34) 



^^«_ E{fi^j-Tf)-E{f,-Tf) ^ ^ 



a.(^,.F„.F,)+Ma;(^,J-0 
and from (29) and (30), 

(35) ^,j i^(i.-^/)-g(%-r/) , 

Now suppose that f £ J^i. We shall only give details of the proof when 

1 



as the case when 



ETj-Tf>jMu[—,T 



ETj-Tf<-^Muj[^,T 



is handled in a similar way. When ETj — Tf > ^Mu}{—7=,J-) then it follows 
from (31) that 



p,i _ E{Tj-Tf)-E{T,^,-Tf) 



u;(^,J-„.F,)+Ma;(^,J-, 



(36) > 7Mu;(;^,-^)-^^(;^, -^i , -^j : 



> 



7-1 
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Now without loss of generality suppose that i = 1 and that j = 2. Then if 



^ ■ 



: 2 note that Z2^i < sup,^^]^(zi^,.) and since ^2 i — -^2,1 it follows that 



k 



P(i = 2) < ^ P(4,i - h,r < 0) 



r=2 



< E{^(4,1 - A,r < 0) + P(4,l - ^r,r. < 0)}. 



r=2 

Now by (32) 4 i ~ -^i r and 4 i ~ -^i r both have normal distributions with 
variance less than or equal to 4 and by (34)-(36) means greater than or equal 
to ^^^ and the lemma now follows from the bound on a standard normal 
random variable Z, 

P{Z >t)< exp 



2 
which holds for all i > 0. D 

Although our main focus is on mean squared error we shall consider the 
more general case of pth power loss. Such general cases are important in 
the theory of adaptation [see Cai and Low (2002)]. Lemma 2 can be used 
to bound the risk of the estimator T* defined by (33) as in the following 
theorem. 

Theorem 4. Suppose either the white noise model (1) or the sequence 
model {2) is given. Let T = Uj=i-^j where k>2 and Ti are closed convex sets 
with J-i r\J-j^0 for all i,j. Let T* be the estimator of the linear functional 
T defined as in {33). Then for p>l, 

(37) sup E\f * -T f \P <C{p) MP {In kf/'^ujPf^, J' 

where the constant C{p) is independent of M , k and n. 

Remark. Note that we can always find linear estimators Tj for which 
M < 1 in (29) and so the theorem yields an upper bound on the minimax 
risk over T which only depends on the modulus and the number k. There is 
also a minimax lower bound for the pth power loss analogous to that given 
in (28), 

(38) uiisupE\f-Tf\P>h{p)ujP(-^,r 

f far \V^ 

for some constant b{p) > 0. By comparing the upper bound in (37) with this 
bound it is clear that for fixed finite k the estimator T* is rate optimal over 
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T . It is also worth noting that sometimes the lower bound is not asymptoti- 
cally sharp when k is finite but grows with n. In Section 5 we give examples 
where k grows with n and the optimal rate is given by the upper bound in 
equation (37). 

In the theory of adaptation the goal is to find a procedure which is simul- 
taneously near minimax over a collection of parameter spaces. If a collection 
of convex parameter spaces is not nested then the largest of the minimax 
risks for each convex parameter space may be smaller than the minimax 
risk over the union of the convex parameter spaces [see, e.g., Cai and Low 
(2002)]. In such cases an appropriate benchmark for the maximum risk of 
an adaptive estimator is given by the bound in Theorem 4. 

The proof of Theorem 4 is based on Lemma 2 and the following bound 
on the tail probabilities of a maximum of Gaussian random variables. 

Lemma 3. Let Xi, i = 1, . . . ,m, be normal random variables with means 
in and standard deviations (Ji<a. Suppose that |;U.j — /.i| < 7 for i = 1, . . . ,m, 
and c> is a constant. Then 



(39) pi max |Xi - /i| > 7 + Vein ma < m^"'^/^ 

\l<j<m / 

Proof. We shall assume that m>2 and that c > 2, since otherwise the 
bound is trivial. Denote by Z a standard Gaussian random variable. Then 

P[ max \Xi — /i| > 7 + vchvma ) < "S^ P{\Xi — /i| > 7 + vclnrna") 

1 = 1 

< mP{\Z\ > vclnm) 

< mi-^/2. 

The last inequality follows from standard bounds on tail probabilities of 
Gaussian distributions once we note that clnm > 1 when c > 2 and m>2. 
D 



Proof of Theorem 4. Let A > 1 and Dx = S + V32Xln2k^ and we 
will write Dp when X= p. Then it is easy to check from Lemma 2 that if 
\Ef, - Tf\ > DxMuj{j^,T) then P(i = i)< ^. 

Let 

h = !^i:\Ef,-Tf\ > DpMLo(^^,T^y 
l2 = li:\Ef,-Tf\<DpMLo(-^,J']\. 
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We then have 

E\f* -Tf\P 

= Y. Em - TfnCi =i)) + Y. Em - TinCi = i)) 

(40) ieh iah 

<Y.{Eifi-TffPf\p(i = i)f^ + E(ms^x\fi-Tff 

Now note that, if X has a normal distribution with mean /j, and variance a"^ 
and p>l, then a straightforward calculation shows that 

where aj = E\Z\^ for Z a standard Gaussian random variable. If i G /i, then 
for some X>p, 



and so for i G /i and such a choice of A > p, 

iE\f, - Tf\'P)'/\P{i = i)f^ < 2^AF(4/' + Z?^)^P(^-L,^^ __L_, 
Now note that, if A; > 2, 

J^P £)P DP 

and hence 

|2p\l/2/p/J_ •\a/2 ^ r,p, .p/ 1/2 ^«. ,,/ 1 



(41) 5^(i?|f- - r/|2P)V2(p(- = ^))i/2 < 2'PMP{ay;+Dl)u:H ^,T 

Let m be the cardinality of the set I2. Now note that, if m < 1, then 

(42) E{um^\fi-Tf\P^<B{p)u:Pl^-^,T 



and the theorem now follows from (40)-(42). If m > 2, then 

£;fmax|T;-r/|p' 



\i&l2 



<{Dp + V^]^fMPujP(^,T 



'n 



00 f 
x^UDp + VUnm)^ 

/=3 *- 
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X p( (D„ + J(l - l)lnm)Aduj( -^,J'] <max\fi-Tf\ 

<{Dp + Vnnm)Mu;(^=,T 
<{Dp + V3lnmfMPujP(^,J' 



+ MPujP{^,J^ 



Z=3 "^ 






Note that it follows from the definition of I2 and the fact that the variance 
of Ti is bounded by M'^uj'^{—^,T')) and Lemma 3 that 

(43) pfmax|fi-r/| >{Dp+ J {I -l)\nm)Moj(-^,F\] Km'^^-^^/'^. 



Hence 



^(^max|f;-r/|p 

z=3 ^^^ 



< 



{Dp + V31nmf + Y^{Dp + V/lnm) 



< 5(p)(lnfcf /^^Fw^l'^,^ 



(44) 



The theorem now follows on combining (40), (41) and (44). D 

4. Linear estimators. We now consider the performance of linear pro- 
cedures. As mentioned in the Introduction, the optimal linear procedure 
is within a small constant factor of the minimax risk when the parameter 
space is convex. The following theorem considers the case when the param- 
eter space is nonconvex. Let T denote a parameter set and let C.Hul^J-") 
denote the convex hull of J-. 

Theorem 5. Consider the white noise model {!) or the sequence model 
{2). The minimax linear risk over a parameter set T is the same as the 
minimax linear risk over the convex hull of T , that is, 

R\(n;T) = R\(n;C:Q^A\{J=)). 
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This theorem is a direct consequence of the following result. 

Theorem 6. Let T be a linear estimator of Tf where T is a linear 
functional. Then for any T 

(45) sup£;j(r-r/)2= sup Ef{f-Tff. 

f&r /ec.Huii(:F) 

Proof. Since T C C.Hun(.F), it is obvious that 

sup^(r-r/)2< sup E{f-Tff. 
far /GC.Huii{j^) 

Let / G C.Hun(J^) and / = Ei hfi with /i G J", Ai > and Ei A^ = 1. Then 
the squared bias 

{Eff-Tff={y^K{EfT-Tf,)\ < {y^K\EfT-Tf,\\ 

< max\EfT - Tfi\^ < sup(E/f - Tff. 

It then follows from the fact that a linear estimator has constant variance 
that 

sup E{f-Tff<snvE{f-Tff. 
/eC.Hull(J^) f&r LJ 

Note that equation (45) is not necessarily true for nonlinear procedures. 
The following corollary is a direct consequence of Theorem 5. 



Corollary 1. 



2.. r.x.„u/^^^ V(4n) 



(46) i?l(n;^)=supa.^(n,C.Hull(.F)) 

£>o l/n + e^/4 

and 

(47) ^LO^I^-^,C.^u\\{T)^<R\{n-T) < ^a;2(^-^, C.Hull(.F)) . 

Thus the minimax linear risk is determined by the modulus of continuity 
over the convex hull oij-, not over T itself. In the case that w(e, C.Hull(.7-")) ^ 
uj{e, J-) , linear procedures will perform poorly. Examples which illustrate this 
point are contained in the next section. 
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5. Examples. In this section we discuss examples where the modulus of 
continuity over the convex hull of the parameter space is much larger than 
the modulus of continuity over the parameter space. Since the performance 
of the optimal linear procedure is determined by the modulus of the convex 
hull of the parameter space linear procedures perform badly in these cases. 
On the other hand, the nonlinear procedure introduced in Section 3 is within 
a constant factor of the minimax risk. 

5.1. Estimating functions at a point. Suppose we observe the white noise 
model (1) over the interval [— ^, ^] and we wish to estimate Tf = /(O). 
We recall that a function is Lip(a) (0 < a < 1) over an interval [a, b] if 



\f{x)- f{y)\ <\x-y\'^ for all x,y £ [a,b]. 



Let 



and 



.7^1 = {/ : / is continuous on [— |, 2] 

with maximum at and / is Lip(l) over [—^,0]} 

.F2 = {/ : / is continuous on [— i, i] 

with maximum at and / is Lip(2) over [0, 2]}- 

Let J- = J-iD J-2. The parameter spaces J-'i and J^2 are both convex, but J- 
is nonconvex. It is easy to see that 

C.Hull(.F) = {AH continuous functions over [—21 5] with maximum at 0}. 

The convex hull of J-' is "much larger" than J^. By straightforward calcula- 
tions it is easy to verify that for Tf = /(O) and small e > 0, 

a;(e,.Fi)=a;(e,.^2,^i) = 3^/^/3, 

a;(e,.7^2) =^(£,^1,^2) = 2i/V/2(l + 0(1)) 

so uj{e,T) = 2^/^e^/^{l + o(l)). But w(e, C.Hull(J^)) = 00. 

It follows from Theorem 4 that the minimax mean squared error rate of 
convergence for estimating the linear functional Tf = /(O) is n~^". How- 
ever, the maximum risk of any linear estimator over J^ is not even bounded. 
[This follows from the fact that a;(e, C.Hull(jr)) = cx).] In other words, linear 
estimators do not work at all in this case. 

5.2. Estimating a linear functional of nearly black objects. In this exam- 
ple we consider the Gaussian sequence model 

(48) yi = fi + n~^/'^Zi, i = l,...,n, 
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where Zi '~ ' A^(0,1). The size of the vector, n, is assumed large; we are 
interested in asymptotics in which the number of variables is large. We 
assume that the vector / is sparse: only a small fraction of components are 
nonzero, and the indices, or locations of the nonzero components are not 
known in advance. 

Denote the Iq quasi-norm by ||/||o = Card({i : /j / 0}). Fix /c.„, the collec- 
tion of vectors with at most fc„ nonzero entries is 

^ = 4(A:n)={/GM":||/||o<fc4. 

Following Donoho, Johnstone, Hoch and Stern (1992), we call a setting 
nearly black when the fraction of nonzero components fc^/n ~ 0, by analogy 
with night-sky images. In this example we assume that kn is known and 
kn < Cn'^ where e < 1/2. 

A motivation for this model is provided by wavelet analysis, since the 
wavelet representation of many smooth and piecewise smooth signals is 
sparse and nearly black in this sense [see, e.g., Donoho, Johnstone, Kerky- 
acharian and Picard (1995)]. For estimating the whole object, this model 
has also been studied in Donoho, Johnstone, Hoch and Stern (1992) and 
Abramovich, Benjamini, Donoho and Johnstone (2000). 

In the present paper we are interested in estimating the linear functional 
of the unknown vector / given by 

n 

j=l 

Let I{k,n) be the class of all subsets of {l,...,n} of k elements and for 
I €l{k,n) let 

.F, = {/GlR":/j=OVj^/}. 

Note that J-'j is a A;„-dimensional subspace spanned by the coordinates in /. 
These are obviously convex and J-' =Uj-'i where the union is taken over / in 
the set I{kn,n). From now on we shall assume that / is in the set I{kn,n). 
Linear procedures perform poorly over J^. In fact it is easy to see that the 
convex hull of J^ is the whole of M" and 

Lo{e, C.Hull(J")) = uj{e, M") = y/^e. 

It then follows from Theorem 5 that any linear estimator must have maxi- 
mum mean squared error over J^ of at least 1. In fact it is easy to see that 
the best linear procedure is simply T = J27=i Hi- 

Nonlinear procedures can perform much better. Our general construction 
given in Section 3 starts with linear estimators constructed assuming that 
f £ J-l- In this case it is natural to start with Tj the minimax estimator over 
J^i since this estimator is linear, unbiased over J^j and has variance equal 
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to -^. Tj is in fact just the sum of yj with j G /. In this example equation 
(29) holds for aU / G J(A;„,n) with M = 1 and u;2( i , jr^) = ^. 

The construction of T* is also based on the modulus uj{e, !Fi,!Fj) between 
J- 1 and J- J. Note that a least favorable pair of parameters is given by one 
parameter which has the A:„ coefficients in J all equal to some given value 
a > and the rest zero and the second parameter has the coefficients in J \ / 
equal to —a and the rest zero. By choosing a so that the I2 distance between 
these parameters is equal to e it is easy to check that 



a;(e,J^/,J^j) = •y/Card(IU J)e 
and consequently 

Now let Ti^j be defined as in Section 3. It is easy to see in this case that 

Ti,j= ^ yi. 

l&IUJ 

Let N be the number of parameter spaces. Then A^ is equal to n choose 
kn and it is easy to see that 

N=( r]< n^^ 



It then follows from Theorem 4 that if T* is defined by (33), then 
(49) sup^(T*-r/)2<C^ 



/6.F 



n 



The following theorem shows that the estimator T* is in fact rate optimal. 
The theorem gives a minimax lower bound based on using a mixture prior 
and a constrained risk inequality introduced in Brown and Low (1996). 

Theorem 7. Let Tf = X^iLi /«• Suppose that n> 4 and that kn < n^ 
with £ < 1/2. Then 

(50) inf sup E{f - Tff > ^^Ij^) . 

T f&:F 121 n \k^J 

Remark. Comparing the minimax lower bound (50) with the risk upper 
bound for T* , for kn < Crf with e < 1/2, the estimator T* is within a 
constant factor of the minimax risk. For example, for kn = n^ with e < 1/2, 
the risk of T* converges at the rate of ?i~^^~^^^ logn which is optimal. 
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Proof of Theorem 7. In the proof we will omit the subscript in kn 
and simply write k for kn- Let ip^ be the density of a normal distribution 
with mean ^u and variance -. And for / sX(A:,n) let 

n 

where fj = -^l(j G /). Finally let 

k) IeT{k,n) 

and / = n?=i /o be the density of n independent normal random variables 
each with mean and variance - . Note that a similar mixture prior was used 
in Baraud (2000) to give lower bounds in a nonparametric testing problem. 
Now note that if 



for all / G I{k,n) then it follows that 

2 



E,[6-k^\ <C. 



We will now apply the constrained risk inequality of Brown and Low (1996). 
First we need to calculate a chi-squared distance between / and g. This is 
done as follows. Note that 





■' Uj /eX{fc,n) /'GX(fe,n) •" 


9I9I' 
f 


and simple 


! calculations show that 








[9191 
J f 


-=exp{jp'^) 




where j is 


the number of points in 


the set / n /'. It follows that 




/f 


= Eex.p{Jp'^) 




where J has a hypergeometric distribution 






P(.T=n') 


_c)a::) 





C) 
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Now note that from Feller [(1968), page 59], 

Now suppose that ?i > 4 and that k < n^'^. Then 

1 _ ^^ < 4^7" < 4 
n 

and hence 









P{J 


-^'-0) 




n) 


It 


now follows that if n 


> 4 and k <n 


1/2 then 












/^ = Eexp 


(V) 












<4fl- 


,^ + V' 

n n 


I 


Now take p = 


= \A^ 


■^ and it follows that 












ii<-^' 


(-1)' 





<4e. 

It then follows from the constrained risk inequality in Brown and Low (1996) 
that if 

(51) EK'^-0)^<ci^ln^ 

then 

p \^ k"^ n k I tT / k'^ n 



(52) ^V V^J n k"^ \/^V fc2 V n k"^ 

k"^ n , , , 

= -1-^(1 -4eVHr). 

The theorem now follows on taking ci = 1 + 8e^ — 4e\/l + 46^. D 

5.3. Structured nearly black objects. We will now consider an example 
under the Gaussian sequence model (48) where most of the coordinates are 
zero but where we shall also assume that the kn nonzero coordinates appear 
consecutively and that < kn <n'^ for some e < 1 . Again kn is assumed to 
be known. Let 

J^(a, kn) = {f £ R"- : fi = unless a<i<a + kn-l} 



24 T. T. CAI AND M. G. LOW 

and 

a=l 

We call members of J- structured nearly black objects. It is easy to see that 
the convex hull of J- is again the whole of M" . It thus follows from Theorem 
5 that linear procedures perform poorly for estimating Tf over T. 
Let fa = Ei'=a"~^ Vi- Then fa^b as defined in Section 3 is given by 

fa,b = J2 yi^(^ £[a,a + kn-l]U[b,b+kn-l]). 

Note that ^ is a union of only n — kn convex sets and so it then follows 
from Theorem 4 that if T* is now defined by (33) then 

(53) supE{f*-Tff<C^^!^. 

Equation (53) gives an upper bound for the minimax risk. We shall now 
show that this upper bound is rate sharp. In fact we shall show that if n > 4 
and kn < n^ with e < 1, then 

(54) infsupi?(f-r/)2>l^lnf^ 

This can be seen as follows. Denote the index sets Ia = {i'-o,<i <a + kn — l} 
and let T{kn, n) = Ua=i" ^a- As in the previous example let ipf be the density 
of a normal distribution with mean / and variance -. And for / G I{kn,n) 
let 

n 

9i{yi,---,yn) = Yli^f.ivj) 

where /, = ^l(j G /). Finally let g = ^^ E?=i'" 9i and / = n"=i /o be 
the density of n independent normal random variables each with mean 
and variance -. Following the argument in the previous example we note 
that 

^ = i?exp(V) 



where this time J satisfies 



and for 1 <i <kn, 



PiJ = o)- "■ 



n 



P{J = i) = -. 
n 
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Hence 



/ 

Now set 



/g2 J^ 

— < 1 + — exp(fc„/9^). 




2 

Then / ^ < 4 and (54) now follows as in (51) and (52). 

Acknowledgment. We thank two referees for their thorough and useful 
comments which have helped to improve the presentation of the paper. 

REFERENCES 

[1] Abramovich, p., Benjamini, Y., Donoho, D. and Johnstone, I. (2000). Adapting 
to unknown sparsity by controlling the false discovery rate. Technical report 2000-19, 
Dept. Statistics, Stanford Univ. 

[2] Baraud, Y. (2000). Nonasymptotic minimax rates of testing in signal detection. 
Technical report, Ecole Normale Superieure. 

[3] Brown, L. D. and Low, M. G. (1996). A constrained risk inequality with applica- 
tions to nonparametric functional estimation. Ann. Statist. 24 2524-2535. MR1425965 

[4] Cai, T. and Low, M. (2002). On modulus of continuity and adaptability in nonpara- 
metric functional estimation. Technical report, Dept. Statistics, Univ. Pennsylvania. 

[5] DONOHO, D. L. (1994). Statistical estimation and optimal recovery. Ann. Statist. 22 
238-270. MR1272082 

[6] DoNOHO, D. L., Johnstone, I. M., Hoch, J. C. and Stern, A. S. (1992). Maxi- 
mum entropy and the nearly black object (with discussion). J. Roy. Statist. Soc. Ser. 
B 54 41-81. MR1157714 

[7] Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1995). 
Wavelet shrinkage: Asymptopia? J. Roy. Statist. Soc. Ser. B 57 301-369. MR1323344 

[8] Donoho, D. L. and Liu, R. C. (1991a). Geometrizing rates of convergence. II. Ann. 
Statist. 19 633-667. MRl 105839 

[9] Donoho, D. L. and Liu, R. G. (1991b). Geometrizing rates of convergence. III. 
Ann. Statist. 19 668-701. MR1105839 
[10] Efromovich, S. and Low, M. G. (1994). Adaptive estimates of linear functionals. 

Probab. Theory Related Fields 98 261-275. MR1258989 
[11] Feller, W. (1968). An Introduction to Probability Theory and Its Applications 1, 

3rd ed. Wiley, New York. MR228020 
[12] Ibragimov, I. A. and Has'minskii, R. Z. (1984). Nonparametric estimation of the 
value of a linear functional in Gaussian white noise. Theory Probab. Appl. 29 18-32. 
MR739497 
[13] Lepski, O. V. (1990). On a problem of adaptive estimation in Gaussian white noise. 

Theory Probab. Appl. 35 454-466. MR1091202 
[14] Low, M. G. (1995). Bias-variance tradeoffs in functional estimation problems. Ann. 
Statist. 23 824-835. MR1345202 



26 T. T. CAI AND M. G. LOW 



Department of Statistics 

The Wharton School 

University of Pennsylvania 

Philadelphia, Pennsylvania 19104-6340 

USA 

E-MAIL: tcai@wharton.upenn.edu 

E-MAIL: lowmOwharton. upenn.edu 



